Super Tic Tac Toe Reinforcement Learning

August 8, 2024

Code

All code here can be found in this repository.

A Note

First of all, I'm sorry for not posting much this summer. I've been pretty wrapped up in some cool stuff.

Introduction

Super Tic Tac Toe is a variant of the classic game tic tac toe. Presh Talwalkar explains it well here. My goal with this short post is to design (and not run) a reinforcement learning procedure in order to have a Deep Q Network (DQN) learn the optimal strategy to win super tic tac toe. The structure of this post is as follows:

Review of Reinforcement Learning
Representing Super Tic Tac Toe Mathematically
Learning Procedure

Note: I will not be training the model. I'm just exercising my understanding of reinforcement learning.

1. Review of Reinforcement Learning

Reinforcement Learning (RL) is a branch of machine learning where an agent learns to make decisions by performing actions in an environment to maximize some notion of cumulative reward. The key components of an RL system are:

Agent: The decision-maker.
Environment: The world in which the agent operates.
State (\(s\)): A representation of the current situation of the environment.
Action (\(a\)): A set of all possible moves the agent can make.
Reward (\(r\)): Feedback from the environment based on the action taken.
Policy (\(\pi\)): The strategy used by the agent to determine the next action based on the current state.

The objective is to learn a policy \(\pi\) that maximizes the expected cumulative reward over time, often defined as the return \( R_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k} \), where \( \gamma \) is the discount factor.

2. Representing Super Tic Tac Toe Mathematically

Super Tic Tac Toe is a complex game comprising a \(3 \times 3\) grid where each cell itself is a \(3 \times 3\) Tic Tac Toe board. Let's denote:

Large board: \(L\), a \(3 \times 3\) array where each element \(L_{ij}\) is a small board \(S_{ij}\).
Small board: \(S_{ij}\), a \(3 \times 3\) matrix representing a standard Tic Tac Toe grid.

The state \( s \) can be represented as a flattened array of size 81:

\[ s = \text{flatten}(\{L_{ij}\}_{i,j=0}^2) \]

The action \( a \) is choosing a cell in the \(3 \times 3\) grid of the designated small board. This can be represented as a single integer in the range \([0, 80]\) for simplicity.

3. Learning Procedure

3.1 Q-Learning

Q-learning is a value-based method of RL which seeks to learn the optimal action-selection policy:

\[ Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_a' Q(s', a') - Q(s, a) \right] \]

where:

\(\alpha\) is the learning rate.
\(\gamma\) is the discount factor.
\(s'\) is the next state.
\(a'\) is the next action.

3.2 Deep Q-Network

We use a neural network to approximate the Q-value function \( Q(s, a; \theta) \), where \( \theta \) are the weights of the network. The loss function for training the network is:

\[ L(\theta) = \mathbb{E}_{(s,a,r,s') \sim D} \left[ \left( r + \gamma \max_{a'} Q(s', a'; \theta^-) - Q(s, a; \theta) \right)^2 \right] \]

where \( \theta^- \) are the weights of the target network, which are periodically updated to stabilize training.

3.3 Training Procedure

Initialization: Initialize replay memory \( D \) and Q-network \( Q(s, a; \theta) \).
Episode Loop:
- Reset environment.
- For each step in the episode:
  - Choose action \( a \) using \(\epsilon\)-greedy policy.
  - Execute action \( a \), observe reward \( r \) and next state \( s' \).
  - Store transition \((s, a, r, s')\) in replay memory \( D \).
  - Sample random mini-batch from \( D \).
  - Perform gradient descent step on \( L(\theta) \) with respect to \( \theta \).